Appearance
๐๏ธ M00: Core Prompting & Context Engineering โ
This foundational module covers the physical, economic, and computational constraints of prompting LLMs in production systems. You will learn to optimize context windows, configure model parameters, and prevent runtime failures.
๐๏ธ 1. Architectural Deep Dive: Attention & KV Caching โ
When designing production agent loops, prompts are not simple strings. They represent input matrices processed by transformer attention mechanisms.
KV Cache & Context Attention โ
During inference, the model stores key-value (KV) states of previous tokens in memory (KV Cache) to avoid re-evaluating context at every token generation step.
- VRAM Overhead: The size of the KV Cache scales linearly with context length and number of concurrent requests. Large contexts pressure GPU VRAM, increasing latency and TTFT (Time To First Token).
- Attention Drift: As context length grows, the model's self-attention score spreads thin, causing it to ignore instructions or constraints placed in the middle of the prompt (the "Lost in the Middle" phenomenon).
Prompt Caching Internals โ
To mitigate KV Cache overhead, providers like Google Gemini and Anthropic Claude cache identical leading context headers at the API level.
- Cache Hits: Caching is triggered for prefixes exceeding a minimum size (e.g. 1,024 tokens for Claude, 32,768 tokens for Gemini).
- Byte-for-Byte Match: The cached prefix (system prompts, database schemas, code libraries) must be exactly identical. A single character change (including whitespace) invalidates the cache.
- Token Economics: Cached input tokens are charged at a significantly reduced rate (up to 90% cheaper than standard input tokens).
๐ 2. Tradeoff Matrix: Context Optimization Methods โ
| Method | Latency (TTFT) | VRAM Footprint | Token Cost | Output Consistency | Primary Production Bottleneck |
|---|---|---|---|---|---|
| Zero-Shot Prompting | Ultra-Low (< 200ms) | Negligible | Very Low | Brittle / Low | Hallucinations on structured formats |
| Few-Shot XML Prompting | Moderate (~500ms) | Low | Low | Very High | Token inflation from repetitive examples |
| Context Caching | Low (after 1st run) | High (GCP managed) | Ultra-Low (90% off) | High | Cache eviction cycles on long idle states |
| Context Pruning | Moderate | Low | Low | High | Information loss from aggressive compression |
๐ ๏ธ 3. Step-by-Step Mechanics: Structured Prompts & Tuning โ
To write deterministic code interfaces, we use a structured format combining XML tags and Low-Temperature Parameter Tuning.
1. XML Structured Markup โ
XML tags act as clear boundaries, preventing the LLM from confusing system instructions with user-submitted data payloads:
xml
<system_instructions>
You are an expert PostgreSQL database administrator.
Your goal is to output SQL statements based on user schemas.
</system_instructions>
<constraints>
- Output raw SQL only.
- Do NOT include explanation blocks.
</constraints>
<schema_context>
CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT);
</schema_context>
<user_query>
Add an email column to the users table.
</user_query>2. Parameter Tuning Configurations โ
Set these parameters in your API config payloads:
temperature = 0.0: Forces the model to select the token with the absolute highest probability. This is mandatory for coding and JSON serialization tasks to prevent syntax formatting failures.max_output_tokens: Must be set with a safety buffer (e.g.2048or4096). If set too low, outputs truncate mid-sentence, causing JSON parsing crashes.top_p&top_k: Set to default or1.0iftemperature = 0.0. If tuning a reasoning agent, keeptop_pat0.95to allow minor variation while pruning low-probability nonsense tokens.
๐ก๏ธ 4. Failure Mode Analysis: Mitigating Prompt Failures โ
| Failure Mode | Log Signature / Error | Root Cause | Code Mitigation |
|---|---|---|---|
| JSON Parse Crash | json.decoder.JSONDecodeError | Output truncated due to low max_output_tokens. | Increase max_output_tokens or implement Pydantic validation retries. |
| Attention Loss | Agent ignores negative constraints. | Context window overflow or instruction placed in the middle. | Wrap target contents in XML tags; place core constraints at the end of the prompt. |
| Cache Invalidation | Increased token billing on consecutive runs. | Prompt prefix changed (dynamic timestamps, variables, or whitespace). | Place all dynamic inputs (user query, runtime variables) at the absolute bottom of the payload. |
| Rate Limiting | ResourceExhausted (429) | Exceeded provider TPM/RPM constraints. | Implement exponential backoff retry loops using tenacity. |
๐งช 5. Runtime Verification: What to Observe โ
To verify your prompt design and cache behavior:
Test 1: Cold vs. Hot Run Verification โ
- Launch the Gemini CLI or execute a Python script loading a large (35k token) codebase context.
- Observe TTFT:
- First Run (Cold): Latency will spike (~5-8s) as the API gateway compiles and caches the KV states.
- Second Run (Hot): Latency should drop to <1.5s, indicating a successful cache hit.
- Audit the API request log. Confirm that the billing output logs list the cached input token counts matching your codebase context size.
Test 2: XML Parameter Tuning โ
- Run a script that requests structured JSON using a temperature of
0.8. - Perform 50 consecutive runs. Count the number of runs where the output fails to parse (e.g., trailing commas, missing brackets).
- Change temperature to
0.0and repeat. Confirm that formatting errors drop to0%.